# ik_llama.cpp ## Docs - [Building from source](https://mintlify.wiki/ikawrakow/ik_llama.cpp/building.md): Build ik_llama.cpp for CPU, CUDA, Metal, ROCm, and other backends - [Android deployment](https://mintlify.wiki/ikawrakow/ik_llama.cpp/deployment/android.md): Run ik_llama.cpp on Android using Termux or the Android NDK - [Docker deployment](https://mintlify.wiki/ikawrakow/ik_llama.cpp/deployment/docker.md): Run ik_llama.cpp in a Docker or Podman container - [Performance troubleshooting](https://mintlify.wiki/ikawrakow/ik_llama.cpp/deployment/performance-tips.md): Diagnose and improve token generation speed - [FlashMLA](https://mintlify.wiki/ikawrakow/ik_llama.cpp/features/flash-mla.md): Optimized Multi-Head Latent Attention for DeepSeek models on CPU and CUDA - [Function calling](https://mintlify.wiki/ikawrakow/ik_llama.cpp/features/function-calling.md): Use OpenAI-style tool calling with any model via Jinja templates - [Multimodal (vision)](https://mintlify.wiki/ikawrakow/ik_llama.cpp/features/multimodal.md): Run vision-language models with image input using llama-mtmd-cli and llama-server - [Speculative decoding](https://mintlify.wiki/ikawrakow/ik_llama.cpp/features/speculative-decoding.md): Accelerate token generation with draft models, n-gram caches, and ngram-mod - [GPU offloading](https://mintlify.wiki/ikawrakow/ik_llama.cpp/inference/gpu-offload.md): Configure GPU offloading to maximize inference performance with CUDA - [Hybrid CPU/GPU inference](https://mintlify.wiki/ikawrakow/ik_llama.cpp/inference/hybrid-cpu-gpu.md): Run large models that don't fit in VRAM using RAM+VRAM hybrid offloading - [Parameters reference](https://mintlify.wiki/ikawrakow/ik_llama.cpp/inference/parameters.md): Complete reference for ik_llama.cpp command-line parameters - [Running the server](https://mintlify.wiki/ikawrakow/ik_llama.cpp/inference/server.md): Start the llama-server for OpenAI-compatible LLM inference with a built-in WebUI - [Introduction](https://mintlify.wiki/ikawrakow/ik_llama.cpp/introduction.md): What is ik_llama.cpp and how does it differ from llama.cpp? - [Importance matrix (imatrix)](https://mintlify.wiki/ikawrakow/ik_llama.cpp/quantization/imatrix.md): Generate and use importance matrices to improve quantization quality - [IQK quantization types](https://mintlify.wiki/ikawrakow/ik_llama.cpp/quantization/iqk-quants.md): State-of-the-art IQK quantization formats exclusive to ik_llama.cpp - [Quantization overview](https://mintlify.wiki/ikawrakow/ik_llama.cpp/quantization/overview.md): Understanding quantization types in ik_llama.cpp: IQK, Trellis, legacy k-quants, and more - [Trellis quantization](https://mintlify.wiki/ikawrakow/ik_llama.cpp/quantization/trellis-quants.md): IQ1_KT, IQ2_KT, IQ3_KT, IQ4_KT: novel integer trellis-based quantization for extreme compression - [Quickstart](https://mintlify.wiki/ikawrakow/ik_llama.cpp/quickstart.md): Get ik_llama.cpp running in minutes on CPU or GPU - [Build options](https://mintlify.wiki/ikawrakow/ik_llama.cpp/reference/build-options.md): CMake flags and environment variables for building ik_llama.cpp - [llama-server reference](https://mintlify.wiki/ikawrakow/ik_llama.cpp/reference/cli-server.md): CLI flags for the llama-server inference server - [CLI tools reference](https://mintlify.wiki/ikawrakow/ik_llama.cpp/reference/cli-tools.md): llama-cli, llama-quantize, llama-imatrix, llama-bench, llama-sweep-bench - [Model formats and conversion](https://mintlify.wiki/ikawrakow/ik_llama.cpp/reference/model-formats.md): GGUF format, model splits, and HuggingFace conversion - [Supported models](https://mintlify.wiki/ikawrakow/ik_llama.cpp/reference/supported-models.md): Model families supported by ik_llama.cpp